AITopics

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.59)

Harder, Hans, Vishwasrao, Abhijeet, Guastoni, Luca, Vinuesa, Ricardo, Peitz, Sebastian

Efficient probabilistic surrogate modeling techniques for partially-observed large-scale dynamical systems

arXiv.org Artificial IntelligenceNov-7-2025

This paper is concerned with probabilistic techniques for forecasting dynamical systems described by partial differential equations (such as, for example, the Navier-Stokes equations). In particular, it is investigating and comparing various extensions to the flow matching paradigm that reduce the number of sampling steps. In this regard, it compares direct distillation, progressive distillation, adversarial diffusion distillation, Wasserstein GANs and rectified flows. Moreover, experiments are conducted on a set of challenging systems. In particular, we also address the challenge of directly predicting 2D slices of large-scale 3D simulations, paving the way for efficient inflow generation for solvers.

artificial intelligence, distillation, machine learning, (13 more...)

2511.04641

Country: Europe (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Modeling & Simulation (0.83)
Information Technology > Artificial Intelligence > Vision (0.69)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.69)

Neural Information Processing SystemsAug-15-2025, 03:07:39 GMT

BiT: Robustly Binarized Multi-distilled Transformer

Inspired by the learnable bias proposed in ReActNet (Liu et al., 2020), we further propose elastic In contrast to Bi-Attention proposed in BiBERT (Qin et al., 2021) that removes We conduct meticulous experiments to compare these choices. The binary convolution between the weights and activations that are both binarized to {-1, 1} (i.e. The GLUE benchmark (Wang et al., 2019) includes the following datasets: MNLI Multi-Genre Natural Language Inference is an entailment classification task (Williams et al., QQP Quora Question Pairs is a paraphrase detection task. QNLI Question Natural Language Inference (Wang et al., 2019) is a binary classification task STS-B The Semantic Textual Similarity Benchmark is a sentence pair classification task. The sentence pairs are sourced from online news sources (Dolan & Brockett, 2005).

activation, distillation, robustly binarized multi-distilled transformer, (13 more...)

Industry: Media (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.35)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.30)

arXiv.org Artificial IntelligenceJun-2-2025

SCOUT: Teaching Pre-trained Language Models to Enhance Reasoning via Flow Chain-of-Thought

Li, Guanghao, Jiang, Wenhao, Chen, Mingfeng, Li, Yan, Yu, Hao, Dong, Shuting, Ren, Tao, Tang, Ming, Yuan, Chun

Chain of Thought (CoT) prompting improves the reasoning performance of large language models (LLMs) by encouraging step by step thinking. However, CoT-based methods depend on intermediate reasoning steps, which limits scalability and generalization. Recent work explores recursive reasoning, where LLMs reuse internal layers across iterations to refine latent representations without explicit CoT supervision. While promising, these approaches often require costly pretraining and lack a principled framework for how reasoning should evolve across iterations. We address this gap by introducing Flow Chain of Thought (Flow CoT), a reasoning paradigm that models recursive inference as a progressive trajectory of latent cognitive states. Flow CoT frames each iteration as a distinct cognitive stage deepening reasoning across iterations without relying on manual supervision. To realize this, we propose SCOUT (Stepwise Cognitive Optimization Using Teachers), a lightweight fine tuning framework that enables Flow CoT style reasoning without the need for pretraining. SCOUT uses progressive distillation to align each iteration with a teacher of appropriate capacity, and a cross attention based retrospective module that integrates outputs from previous iterations while preserving the models original computation flow. Experiments across eight reasoning benchmarks show that SCOUT consistently improves both accuracy and explanation quality, achieving up to 1.8% gains under fine tuning. Qualitative analyses further reveal that SCOUT enables progressively deeper reasoning across iterations refining both belief formation and explanation granularity. These results not only validate the effectiveness of SCOUT, but also demonstrate the practical viability of Flow CoT as a scalable framework for enhancing reasoning in LLMs.

artificial intelligence, large language model, natural language, (16 more...)

2505.24181

Country:

Europe (0.28)
Asia (0.28)

Genre:

Research Report (1.00)
Workflow (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.89)

Gupta, Shivam, Karmalkar, Sushrut

Efficient Knowledge Distillation via Curriculum Extraction

arXiv.org Machine LearningMar-21-2025

Knowledge distillation is a technique used to train a small student network using the output generated by a large teacher network, and has many empirical advantages~\citep{Hinton2015DistillingTK}. While the standard one-shot approach to distillation only uses the output of the final teacher network, recent work~\citep{panigrahi2024progressive} has shown that using intermediate checkpoints from the teacher's training process as an implicit ``curriculum'' for progressive distillation can significantly speed up training. However, such schemes require storing these checkpoints, and often require careful selection of the intermediate checkpoints to train on, which can be impractical for large-scale training. In this paper, we show that a curriculum can be \emph{extracted} from just the fully trained teacher network, and that this extracted curriculum can give similar efficiency benefits to those of progressive distillation. Our extraction scheme is natural; we use a random projection of the hidden representations of the teacher network to progressively train the student network, before training using the output of the full network. We show that our scheme significantly outperforms one-shot distillation and achieves a performance similar to that of progressive distillation for learning sparse parities with two-layer networks, and provide theoretical guarantees for this setting. Additionally, we show that our method outperforms one-shot distillation even when using transformer-based architectures, both for sparse-parity learning, and language modeling tasks.

distillation, large language model, machine learning, (21 more...)

arXiv.org Machine Learning

2503.17494

Country: North America > United States > Texas > Travis County > Austin (0.04)

Genre: Research Report (0.82)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsJan-26-2025, 13:44:52 GMT

Review for NeurIPS paper: Kernel Based Progressive Distillation for Adder Neural Networks

Weaknesses: The effectiveness of the kernel method, one of the claimed contributions, is not fully justified. As shown in Table 1, the kernel operation brings insignificant gain on CIFAR 10 with a shallower network of ResNet-20. The gains (below 0.21%) seems insignificant, which may be due to stochastic initialization of networks, suggesting that the proposed kernel scheme may not be so effective as advocated. I advised that comparison on ImageNet with a deeper network (e.g., ResNet-50) is performed. The current experiments are not strong to support that the proposed method is a competitive knowledge distillation method.

adder neural network, distillation, progressive distillation, (12 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.42)

Neural Information Processing SystemsJan-26-2025, 13:44:46 GMT

Review for NeurIPS paper: Kernel Based Progressive Distillation for Adder Neural Networks

I believe that by bridging the gap between Adder NN and CNNs this work provides a considerable contribution, allowing Adder NN to be considered among practical architecture and encouraging the community to research them further. In accordance with the reviewers, I think the proposed method is thoroughly investigated empirically. Please make sure to update the paper with all the results and answers that you have provided in your rebuttal.

adder neural network, neurips paper, progressive distillation, (2 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.40)

Garrepalli, Risheek, Mahajan, Shweta, Hayat, Munawar, Porikli, Fatih

DDIL: Improved Diffusion Distillation With Imitation Learning

arXiv.org Artificial IntelligenceOct-15-2024

Diffusion models excel at generative modeling (e.g., text-to-image) but sampling requires multiple denoising network passes, limiting practicality. Efforts such as progressive distillation or consistency distillation have shown promise by reducing the number of passes at the expense of quality of the generated samples. In this work we identify co-variate shift as one of reason for poor performance of multi-step distilled models from compounding error at inference time. To address co-variate shift, we formulate diffusion distillation within imitation learning (DDIL) framework and enhance training distribution for distilling diffusion models on both data distribution (forward diffusion) and student induced distributions (backward diffusion). Training on data distribution helps to diversify the generations by preserving marginal data distribution and training on student distribution addresses compounding error by correcting covariate shift. In addition, we adopt reflected diffusion formulation for distillation and demonstrate improved performance, stable training across different distillation methods. We show that DDIL consistency improves on baseline algorithms of progressive distillation (PD), Latent consistency models (LCM) and Distribution Matching Distillation (DMD2).

artificial intelligence, distillation, machine learning, (14 more...)

2410.11971

Genre: Research Report (0.64)

Industry: Education (0.73)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.46)

Neural Information Processing SystemsOct-10-2024, 18:44:07 GMT

Kernel Based Progressive Distillation for Adder Neural Networks

adder neural network, kernel, progressive distillation, (1 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.62)

Panigrahi, Abhishek, Liu, Bingbin, Malladi, Sadhika, Risteski, Andrej, Goel, Surbhi

Progressive distillation induces an implicit curriculum

arXiv.org Artificial IntelligenceOct-7-2024

Knowledge distillation leverages a teacher model to improve the training of a student model. A persistent challenge is that a better teacher does not always yield a better student, to which a common mitigation is to use additional supervision from several ``intermediate'' teachers. One empirically validated variant of this principle is progressive distillation, where the student learns from successive intermediate checkpoints of the teacher. Using sparse parity as a sandbox, we identify an implicit curriculum as one mechanism through which progressive distillation accelerates the student's learning. This curriculum is available only through the intermediate checkpoints but not the final converged one, and imparts both empirical acceleration and a provable sample complexity benefit to the student. We then extend our investigation to Transformers trained on probabilistic context-free grammars (PCFGs) and real-world pre-training datasets (Wikipedia and Books). Through probing the teacher model, we identify an analogous implicit curriculum where the model progressively learns features that capture longer context. Our theoretical and empirical findings on sparse parity, complemented by empirical observations on more complex tasks, highlight the benefit of progressive distillation via implicit curriculum across setups.

checkpoint, distillation, progressive distillation, (14 more...)

2410.05464

Country:

Europe > Austria > Vienna (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
(7 more...)

Genre: Research Report > New Finding (0.67)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
(3 more...)